Goto

Collaborating Authors

 reasoning level


Hey GPT-OSS, Looks Like You Got It -- Now Walk Me Through It! An Assessment of the Reasoning Language Models Chain of Thought Mechanism for Digital Forensics

Michelet, Gaëtan, Schneider, Janine, Withanage, Aruna, Breitinger, Frank

arXiv.org Artificial Intelligence

The use of large language models in digital forensics has been widely explored. Beyond identifying potential applications, research has also focused on optimizing model performance for forensic tasks through fine-tuning. However, limited result explainability reduces their operational and legal usability. Recently, a new class of reasoning language models has emerged, designed to handle logic-based tasks through an `internal reasoning' mechanism. Yet, users typically see only the final answer, not the underlying reasoning. One of these reasoning models is gpt-oss, which can be deployed locally, providing full access to its underlying reasoning process. This article presents the first investigation into the potential of reasoning language models for digital forensics. Four test use cases are examined to assess the usability of the reasoning component in supporting result explainability. The evaluation combines a new quantitative metric with qualitative analysis. Findings show that the reasoning component aids in explaining and validating language model outputs in digital forensics at medium reasoning levels, but this support is often limited, and higher reasoning levels do not enhance response quality.


gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, null, :, null, Agarwal, Sandhini, Ahmad, Lama, Ai, Jason, Altman, Sam, Applebaum, Andy, Arbus, Edwin, Arora, Rahul K., Bai, Yu, Baker, Bowen, Bao, Haiming, Barak, Boaz, Bennett, Ally, Bertao, Tyler, Brett, Nivedita, Brevdo, Eugene, Brockman, Greg, Bubeck, Sebastien, Chang, Che, Chen, Kai, Chen, Mark, Cheung, Enoch, Clark, Aidan, Cook, Dan, Dukhan, Marat, Dvorak, Casey, Fives, Kevin, Fomenko, Vlad, Garipov, Timur, Georgiev, Kristian, Glaese, Mia, Gogineni, Tarun, Goucher, Adam, Gross, Lukas, Guzman, Katia Gil, Hallman, John, Hehir, Jackie, Heidecke, Johannes, Helyar, Alec, Hu, Haitang, Huet, Romain, Huh, Jacob, Jain, Saachi, Johnson, Zach, Koch, Chris, Kofman, Irina, Kundel, Dominik, Kwon, Jason, Kyrylov, Volodymyr, Le, Elaine Ya, Leclerc, Guillaume, Lennon, James Park, Lessans, Scott, Lezcano-Casado, Mario, Li, Yuanzhi, Li, Zhuohan, Lin, Ji, Liss, Jordan, Lily, null, Liu, null, Liu, Jiancheng, Lu, Kevin, Lu, Chris, Martinovic, Zoran, McCallum, Lindsay, McGrath, Josh, McKinney, Scott, McLaughlin, Aidan, Mei, Song, Mostovoy, Steve, Mu, Tong, Myles, Gideon, Neitz, Alexander, Nichol, Alex, Pachocki, Jakub, Paino, Alex, Palmie, Dana, Pantuliano, Ashley, Parascandolo, Giambattista, Park, Jongsoo, Pathak, Leher, Paz, Carolina, Peran, Ludovic, Pimenov, Dmitry, Pokrass, Michelle, Proehl, Elizabeth, Qiu, Huida, Raila, Gaby, Raso, Filippo, Ren, Hongyu, Richardson, Kimmy, Robinson, David, Rotsted, Bob, Salman, Hadi, Sanjeev, Suvansh, Schwarzer, Max, Sculley, D., Sikchi, Harshit, Simon, Kendal, Singhal, Karan, Song, Yang, Stuckey, Dane, Sun, Zhiqing, Tillet, Philippe, Toizer, Sam, Tsimpourlas, Foivos, Vyas, Nikhil, Wallace, Eric, Wang, Xin, Wang, Miles, Watkins, Olivia, Weil, Kevin, Wendling, Amy, Whinnery, Kevin, Whitney, Cedric, Wong, Hannah, Yang, Lin, Yang, Yu, Yasunaga, Michihiro, Ying, Kristen, Zaremba, Wojciech, Zhan, Wenting, Zhang, Cyril, Zhang, Brian, Zhang, Eddie, Zhao, Shengjia

arXiv.org Artificial Intelligence

We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.


LLM-Stackelberg Games: Conjectural Reasoning Equilibria and Their Applications to Spearphishing

Zhu, Quanyan

arXiv.org Artificial Intelligence

We introduce the framework of LLM-Stackelberg games, a class of sequential decision-making models that integrate large language models (LLMs) into strategic interactions between a leader and a follower. Departing from classical Stackelberg assumptions of complete information and rational agents, our formulation allows each agent to reason through structured prompts, generate probabilistic behaviors via LLMs, and adapt their strategies through internal cognition and belief updates. We define two equilibrium concepts: reasoning and behavioral equilibrium, which aligns an agent's internal prompt-based reasoning with observable behavior, and conjectural reasoning equilibrium, which accounts for epistemic uncertainty through parameterized models over an opponent's response. These layered constructs capture bounded rationality, asymmetric information, and meta-cognitive adaptation. We illustrate the framework through a spearphishing case study, where a sender and a recipient engage in a deception game using structured reasoning prompts. This example highlights the cognitive richness and adversarial potential of LLM-mediated interactions. Our results show that LLM-Stackelberg games provide a powerful paradigm for modeling decision-making in domains such as cybersecurity, misinformation, and recommendation systems.


MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

He, Zhitao, Polisetty, Sandeep, Fan, Zhiyuan, Huang, Yuchen, Wu, Shujin, Fung, Yi R.

arXiv.org Artificial Intelligence

In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.


Large Language Model Strategic Reasoning Evaluation through Behavioral Game Theory

Jia, Jingru, Yuan, Zehua, Pan, Junhao, McNamara, Paul E., Chen, Deming

arXiv.org Artificial Intelligence

Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.


Dynamic Game-Theoretical Decision-Making Framework for Vehicle-Pedestrian Interaction with Human Bounded Rationality

Dang, Meiting, Zhao, Dezong, Wang, Yafei, Wei, Chongfeng

arXiv.org Artificial Intelligence

Human-involved interactive environments pose significant challenges for autonomous vehicle decision-making processes due to the complexity and uncertainty of human behavior. It is crucial to develop an explainable and trustworthy decision-making system for autonomous vehicles interacting with pedestrians. Previous studies often used traditional game theory to describe interactions for its interpretability. However, it assumes complete human rationality and unlimited reasoning abilities, which is unrealistic. To solve this limitation and improve model accuracy, this paper proposes a novel framework that integrates the partially observable markov decision process with behavioral game theory to dynamically model AV-pedestrian interactions at the unsignalized intersection. Both the AV and the pedestrian are modeled as dynamic-belief-induced quantal cognitive hierarchy (DB-QCH) models, considering human reasoning limitations and bounded rationality in the decision-making process. In addition, a dynamic belief updating mechanism allows the AV to update its understanding of the opponent's rationality degree in real-time based on observed behaviors and adapt its strategies accordingly. The analysis results indicate that our models effectively simulate vehicle-pedestrian interactions and our proposed AV decision-making approach performs well in safety, efficiency, and smoothness. It closely resembles real-world driving behavior and even achieves more comfortable driving navigation compared to our previous virtual reality experimental data.


SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese

Xu, Liang, Xue, Hang, Zhu, Lei, Zhao, Kangkang

arXiv.org Artificial Intelligence

We introduce SuperCLUE-Math6(SC-Math6), a new benchmark dataset to evaluate the mathematical reasoning abilities of Chinese language models. SC-Math6 is designed as an upgraded Chinese version of the GSM8K dataset with enhanced difficulty, diversity, and application scope. It consists of over 2000 mathematical word problems requiring multi-step reasoning and providing natural language solutions. We propose an innovative scheme to quantify the reasoning capability of large models based on performance over problems with different reasoning steps. Experiments on 13 representative Chinese models demonstrate a clear stratification of reasoning levels, with top models like GPT-4 showing superior performance. SC-Math6 fills the gap in Chinese mathematical reasoning benchmarks and provides a comprehensive testbed to advance the intelligence of Chinese language models.


Modeling Human Driver Interactions Using an Infinite Policy Space Through Gaussian Processes

Yaldiz, Cem Okan, Yildiz, Yildiray

arXiv.org Artificial Intelligence

This paper proposes a method for modeling human driver interactions that relies on multi-output gaussian processes. The proposed method is developed as a refinement of the game theoretical hierarchical reasoning approach called "level-k reasoning" which conventionally assigns discrete levels of behaviors to agents. Although it is shown to be an effective modeling tool, the level-k reasoning approach may pose undesired constraints for predicting human decision making due to a limited number (usually 2 or 3) of driver policies it extracts. The proposed approach is put forward to fill this gap in the literature by introducing a continuous domain framework that enables an infinite policy space. By using the approach presented in this paper, more accurate driver models can be obtained, which can then be employed for creating high fidelity simulation platforms for the validation of autonomous vehicle control algorithms. The proposed method is validated on a real traffic dataset and compared with the conventional level-k approach to demonstrate its contributions and implications.


Reinforcement Learning with Iterative Reasoning for Merging in Dense Traffic

Bouton, Maxime, Nakhaei, Alireza, Isele, David, Fujimura, Kikuo, Kochenderfer, Mykel J.

arXiv.org Artificial Intelligence

To avoid the computational requirements of online methods, we can use reinforcement learning (RL) instead. In RL, In recent years, major progress has been made to deploy the agent interacts with a simulation environment many autonomous vehicles and improve safety. However, certain times prior to execution, and at each simulation episode common driving situations like merging in dense traffic are it improves its strategy. The resulting policy can then be still challenging for autonomous vehicles. Situations like deployed online and is often inexpensive to evaluate. RL the one illustrated in Figure 1 often involve negotiating with provides a flexible framework to automatically find good human drivers.